Introduction

We are constantly consuming multiple genres and subgenres of music each day. Developing an understanding of this music is something widely underappreciated and accomplished, however, the programmers at spotify deem it a necessary feature that was to be integrated into their platform and business model. In doing this, they created an API open for developers to integrate their data into applications and analytics. Other companies like Genius and LastFM have created APIs as well to both bring lyrics and genre tags respectively into the developer’s hands.

   In this work, a preliminary investigation of the past decade of music taken from Billboard’s Top 100 songs is done. Using the Spotify API, these songs, and their related albums have been pulled along with their track features that explain the mood, feeling, or characteristics of tracks. This is a fairly vast dataset of songs and song features that open up a lot of possibility for analysis. Some questions that could be asked are: How does music happiness change across the world? How does happiness relate to other key features in music? Can we form a model that shows this relationship? This and further exporations found at this repo will explore this and more.

Libraries and Imports

#plotting and wrangling
library(tidyverse)
library(highcharter)
library(factoextra)
library(patchwork)
library(sf)

#ml/modeling
library(caret)

#working with time variables
library(lubridate)

#kable
library(kableExtra)

#setting the seed for reproducability
set.seed(543)

Viewing the Data

What is the relationship between energy and valence from the last decade?

From the Spotify API, there are a number of measures defining the features found in music. These features are as follows (taken from the official documentation):

**acousticness**: A confidence measure from 0.0 to 1.0 of whether the track is acoustic.
**danceability**: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity.
**energy**: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.
**instrumentalness**: Predicts whether a track contains no vocals.
**liveness**: Detects the presence of an audience in the recording.
**loudness**: The overall loudness of a track in decibels (dB).
**speechiness**: Speechiness detects the presence of spoken words in a track.
**valence**: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track.

One might think from the descriptions, that energy and valence must be related. Let’s plot them all on a graph to see.

   From this graph it looks as though these two variables are in fact positively correlated. Some of the outliers show that it is possible to be a very happy song without being very energetic. This makes sense in contexts like songs that have really positive lyrics but are not generally so upbeat and happy. Hovering over Whack World in the album list shows one outlier, which happens to fall on an album that generally follows the trend stated before.

How happy was music around the globe?

Being able to see how happy certain albums are is nice, but what about albums across the globe? How does the average happiness of music change as we move overseas?

   According to the data, of the top 100 songs/albums from the billboard top 100 of decade that are available in these countries, the african countries score the highest in happiness. Now keep in mind that this is based on availability not necessarily based on these country’s own best music of the decade. It stands to reason that some music might not be available in all of these countries and this may skew these results.
   What might be more fruitful to look at is how these values change within continents. Take for example, Europe. The eastern countries in Europe tend to have much happier music when compared to their western counterparts. Moldova is the only exception to this rule. Australia and Japan have some of the saddest music from Billboard’s charts, while Venezuela has some of the happiest.

What is the relationship between happiness and the other variables?

There are a few parameters in the Spotify API that are computed as functions of other parameters found in the data. These parameters are energy, loudness, and acousticness. For example, loudness is used in the calculations of energy. For this reason, these terms will be modeled as interaction terms in the regression model. The plots demonstrating these relationships are omitted here for this reason as well. The documentation expounding on these parameters is found here.

   For a GLM, k-folds cross-validation is done without any standardization of the parameters. The interaction terms are included. The objective of this model is developing an understanding of how these other parameters perform as predictors for valence, as well as how well they can be used for an inferential model.
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51752 -0.12028 -0.01606  0.11570  0.64027 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -0.357405   0.094252  -3.792 0.000157 ***
## danceability                    0.511593   0.034168  14.973  < 2e-16 ***
## energy                          0.653016   0.113099   5.774 1.00e-08 ***
## loudness                       -0.013510   0.008845  -1.527 0.126948    
## acousticness                    0.302115   0.120668   2.504 0.012432 *  
## speechiness                     0.188917   0.047946   3.940 8.64e-05 ***
## instrumentalness               -0.041068   0.022154  -1.854 0.064040 .  
## liveness                        0.052949   0.037499   1.412 0.158218    
## `energy:loudness`               0.019084   0.012955   1.473 0.140994    
## `energy:acousticness`          -0.148657   0.187236  -0.794 0.427390    
## `loudness:acousticness`         0.011871   0.010295   1.153 0.249084    
## `energy:loudness:acousticness`  0.003036   0.016958   0.179 0.857961    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1844 on 1125 degrees of freedom
## Multiple R-squared:  0.343,  Adjusted R-squared:  0.3365 
## F-statistic: 53.39 on 11 and 1125 DF,  p-value: < 2.2e-16
   From the model, it can be seen that Danceability, energy, speechiness, the intercept, and acousticness to a lesser extent are significant predictors of valence. Even with the interaction terms, energy is incredibly with a p-value of 2e-16. Unfortunately, the parameters listed only explain about 34% of the variance in valence. This is due to some missing information and imperfect modeling for the individual parameters. From this, the coefficients seem to lead in the direction described above in regards to energy and valence. That is, if there were an increase of energy from one song to another, then there should be an expected increase of about 0.65% in valence.
    Extracting these coefficients should help describe the relationship between valence and these variables. Plotting the model below shows how close this is to modeling what is actually happening.
(Intercept) -0.3574047
danceability 0.5115930
energy 0.6530159
loudness -0.0135102
acousticness 0.3021149
speechiness 0.1889168
instrumentalness -0.0410682
liveness 0.0529490
energy:loudness 0.0190843
energy:acousticness -0.1486566
loudness:acousticness 0.0118714
energy:loudness:acousticness 0.0030356

Results/Conclusion

The model performs reasonably well. As stated above, the low R-squared value is telling that there is some important data missing from this analysis to predict further. This could be anything from more features to artists, album, and genre metadata. The latter are available in the current dataset I have created for this analysis and will be investigated in future analyses. Other plots that I wanted to include here were, results from an out of sample analyses of the data, more exploratory analysis on individual artists and their albums, more maps showing different parameters, and a shiny app developing these plots into more interactive visualizations in which the parameters being shown/compared can be changed and adjusted. This does not fit into the scope of this preliminary analysis and was thus not included here.